V0 version of embedding ingestion core.#1964
V0 version of embedding ingestion core.#1964shixiao-coder wants to merge 12 commits intodatacommonsorg:masterfrom
Conversation
It includes: requirements.txt to install the related packages embedding_utils.py including function to: - read the latest timestamp for the lock, if no timestamp set None - function to find Node IDs to update based on timestamp, nodetype, validity by none empty - function to convert the Node and fields to ID and embedding_content - function to generate embeddings from the ID and embedding content in batch
There was a problem hiding this comment.
Code Review
This pull request introduces a Cloud Function and helper utilities for automating node embedding workflows using Google Cloud Spanner. Key features include fetching updated nodes based on ingestion locks and processing embeddings in batches via ML.PREDICT. Feedback identifies critical issues such as a missing time import and the use of unsupported INSERT OR UPDATE syntax in Spanner SQL. Other improvements include correcting a typo in the initialization action string, removing duplicate imports and redundant initializations, and optimizing result set processing.
…st on a request of 250. If the batch is smaller than 250 and or not divisible by 250. It actually send more requests to Embeddings models, with each batch containing a much smaller number. Batch from 100 -> 500 changes QPM usage from 1000 to 700 Timeout is set since each request is now containing 250 data and will run longer
…loud run Running with experiment deployed image and confirmed proper ingestion
gmechali
left a comment
There was a problem hiding this comment.
Thanks Xiao, this looks great, left a coupe minor comments + one on the query and maybe how to optimize it.
My main feedback is that we need tests to go in this PR too! Can you add a tests/ directory and add unit tests for embedding_utils and e2e tests for main.py?
| Yields: | ||
| Dictionaries containing subject_id and name. | ||
| """ | ||
| timestamp_condition = "update_timestamp > @timestamp" if timestamp else "TRUE" |
There was a problem hiding this comment.
Just to double check, did you get approval from @keyurva and the data team to make this change to the schema to support the timestamp on the Node Table?
Updating header message for code
Added @gmechali |
It includes:
requirements.txt to install the related packages
placeholder for main.py function: placeholder to trigger various embedding ingestion logic
embedding_utils.py including function to: